2026-06-30

Langevin Dynamics

Langevin dynamics is a [[Stochastic Differential Equation (SDE)|stochastic differential equation]] ([[Stochastic Differential Equation (SDE)|SDE]]) that describes the motion of a particle under the combined influence of a deterministic drift (gradient of a potential) and random thermal fluctuations. In machine learning, it serves as a sampling algorithm that generates samples from a target distribution $p (x)$ using only the [[Score Function|score function]] $\nabla_{x} \log p (x)$ — making it the foundational sampling mechanism behind score-based generative models and the corrector step in predictor-corrector diffusion samplers.

1. Core Concept

1.1 Physical Origin: Brownian Motion with Drift

Langevin dynamics originates from statistical physics, describing a Brownian particle in a potential field $U (x)$ :

m \frac{d^{2} x}{d t^{2}} = - \nabla U (x) - γ \frac{d x}{d t} + \sqrt{2 γ k_{B} T} ξ (t)

where:

$m$ : particle mass
$- \nabla U (x)$ : deterministic force from potential $U$
$- γ \frac{d x}{d t}$ : friction (dissipation)
$\sqrt{2 γ k_{B} T} ξ (t)$ : thermal fluctuations (white noise)
$ξ (t)$ : standard Gaussian white noise, $⟨ ξ (t) ξ (t^{'}) ⟩ = δ (t - t^{'})$

1.2 Overdamped Limit

In the overdamped limit ( $m \to 0$ , friction dominant, inertial term negligible), the equation reduces to the overdamped Langevin equation:

d x_{t} = - \nabla U (x_{t}) d t + \sqrt{2} d W_{t}

This is an [[Stochastic Differential Equation (SDE)|Itô SDE]] with:

Drift: $b (x) = - \nabla U (x)$ — follows the negative gradient of the potential
Diffusion: $σ (x) = \sqrt{2}$ — constant additive noise

1.3 From Physics to Sampling

Replace the physical potential $U (x)$ with the negative log-probability:

U (x) \equiv - \log p (x) ⟹ - \nabla U (x) = \nabla_{x} \log p (x)

This yields the Langevin sampling equation:

d x_{t} = \nabla_{x} \log p (x_{t}) d t + \sqrt{2} d W_{t}

The stationary distribution of this [[Stochastic Differential Equation (SDE)|SDE]] is exactly $p (x)$ — meaning as $t \to \infty$ , $x_{t} \sim p (x)$ .

2. Mathematical Foundation

2.1 Stationary Distribution

Theorem: The overdamped Langevin SDE

d x_{t} = \nabla_{x} \log p (x_{t}) d t + \sqrt{2} d W_{t}

has $p (x)$ as its unique stationary distribution under mild regularity conditions.

Proof sketch (via [[Fokker-Planck Equation|Fokker-Planck]]):

The [[Fokker-Planck Equation]] for this [[Stochastic Differential Equation (SDE)|SDE]] is:

\frac{\partial ρ_{t} (x)}{\partial t} = \nabla \cdot (- ρ_{t} \nabla \log p + \nabla ρ_{t})

Setting $\frac{\partial ρ_{t}}{\partial t} = 0$ and substituting $ρ_{\infty} = p$ :

a b l a \cdot (- p a b l a \log p + a b l a p) = a b l a \cdot (- a b l a p + a b l a p) = 0 ✓

2.2 Discrete-Time Approximation (Euler-Maruyama)

The continuous [[Stochastic Differential Equation (SDE)|SDE]] is discretized using the Euler-Maruyama scheme:

x_{k + 1} = x_{k} + η \nabla_{x} \log p (x_{k}) + \sqrt{2 η} z_{k}, z_{k} \sim N (0, I)

where $η$ is the step size.

def langevin_dynamics(score_fn, x_init, n_steps, step_size):
    """
    Unadjusted Langevin Algorithm (ULA).
    
    Args:
        score_fn: Score function ∇_x log p(x)
        x_init:  Initial sample
        n_steps: Number of Langevin steps
        step_size: Step size η
    
    Returns:
        Sample approximately from p(x)
    """
    x = x_init.clone()
    
    for k in range(n_steps):
        score = score_fn(x)
        noise = torch.randn_like(x)
        x = x + step_size * score + math.sqrt(2 * step_size) * noise
    
    return x

2.3 Discretization Error

The Euler-Maruyama discretization introduces an $O (η)$ error in the stationary distribution. The Metropolis-adjusted Langevin algorithm (MALA) corrects this with an accept-reject step:

Algorithm	Acronym	Accept/Reject	Bias	Variance
Unadjusted Langevin	ULA	❌ No	$O (η)$	Lower
Metropolis-Adjusted	MALA	✅ Yes	Asymptotically unbiased	Higher (rejections)
Stochastic Gradient Langevin	SGLD	❌ No	$O (η)$	Lower (scalable)

2.4 Convergence Rate

Under log-concavity ( $- \log p$ is $μ$ -strongly convex and $L$ -smooth), Langevin dynamics converges in Wasserstein-2 distance:

W_{2} (ρ_{t}, p) \leq W_{2} (ρ_{0}, p) e^{- μ t} + O (\frac{\sqrt{d} L}{μ} \sqrt{η})

Key takeaway: convergence is exponentially fast in continuous time, with discretization error $O (\sqrt{η d})$ .

3. Langevin Dynamics for Generative Modeling

3.1 Score-Based Sampling

The breakthrough insight of score-based generative modeling:

If we can learn $\nabla_{x} \log p (x)$ , we can sample from $p (x)$ via Langevin dynamics — without ever computing the normalization constant.

Score-Based Sampling Pipeline
═══════════════════════════════════════
Data x₀ ~ p_data(x)
  │
  ▼
Learn: s_θ(x) ≈ ∇_x log p_data(x)
  │
  ▼
Sample:  x_{k+1} = x_k + η·s_θ(x_k) + √(2η)·z_k
  │
  ▼
x_K ~ p_data(x)   (approximate, for large K, small η)
═══════════════════════════════════════

3.2 Annealed Langevin Dynamics

Problem: A single score model struggles with multi-modal, complex distributions.

Solution (NCSN) : Train at multiple noise levels $σ_{1} < σ_{2} < \dots < σ_{L}$ , then anneal:

def annealed_langevin(score_model, noise_levels, steps_per_level, step_size):
    """
    Annealed Langevin dynamics (NCSN sampling).
    
    Args:
        score_model: Score model s_θ(x, σ)
        noise_levels: [σ_L, ..., σ_1] (largest to smallest)
        steps_per_level: Langevin steps per noise level
        step_size: Step size η (typically η ∝ σ²)
    """
    x = torch.randn(batch_size, *data_shape)  # Start from noise
    
    for sigma in reversed(noise_levels):  # From large to small noise
        # Adapt step size to noise level
        alpha = step_size * (sigma / noise_levels[-1]) ** 2
        
        for _ in range(steps_per_level):
            score = score_model(x, sigma)
            noise = torch.randn_like(x)
            x = x + alpha * score + math.sqrt(2 * alpha) * noise
    
    return x

Annealing schedule design:

Parameter	Typical Value	Rationale
$σ_{L}$ (max)	1.0 – 10.0	Large enough to cover data modes
$σ_{1}$ (min)	0.01	Small enough for precision
$L$ (levels)	10 – 50	Geometric progression: $σ_{i + 1} / σ_{i} = const$
Steps per level	10 – 100	Longer at smaller $σ$ for finer detail
$η$ (step size)	$η \propto σ^{2}$	Ensures stable dynamics at each scale

3.3 Correctors in Predictor-Corrector Framework

In [[Diffusion Model|diffusion models]], Langevin dynamics serves as the corrector that refines samples:

Predictor-Corrector Sampling Loop
═══════════════════════════════════════
For each timestep t = T → 1:
  
  1. PREDICTOR: Advance numerical ODE/SDE solver
     x_t → x_{t-1} (via Euler, DPM-Solver, etc.)
  
  2. CORRECTOR (Langevin): Refine sample using score
     For k = 1 to N_corrector:
       x_{t-1} = x_{t-1} + ε·∇_x log p_{t-1}(x) + √(2ε)·z
  
═══════════════════════════════════════

Why Langevin as corrector?

The predictor step may drift away from the true distribution
Langevin dynamics, given the exact score, converges toward the correct conditional distribution
A few corrector steps significantly improve sample quality

3.4 Comparison: Langevin vs. ODE vs. [[Stochastic Differential Equation (SDE)|SDE]] Sampling

Aspect	Langevin Dynamics	ODE (Probability Flow)	Reverse [[Stochastic Differential Equation (SDE)\|SDE]]
Stochasticity	Stochastic	Deterministic	Stochastic
Score usage	$\nabla_{x} \log p (x)$ as drift	$\nabla_{x} \log p (x)$ in ODE drift	$\nabla_{x} \log p (x)$ in [[Stochastic Differential Equation (SDE)\|SDE]] drift
Convergence guarantee	Yes ( $t \to \infty$ )	Path-dependent (fixed start)	Path-dependent
Step efficiency	Many steps needed	Fewer steps ([[DPM-Solver]])	Many steps
Quality	High (stochastic refinement)	Good (fast)	High
Role	Corrector, standalone sampler	Predictor	Predictor

4. Algorithmic Variants

4.1 MALA: Metropolis-Adjusted Langevin Algorithm

MALA adds a Metropolis-Hastings accept-reject step to remove discretization bias:

def mala_step(x, score_fn, step_size):
    """One step of Metropolis-Adjusted Langevin Algorithm."""
    # Propose move via Langevin
    noise = torch.randn_like(x)
    x_proposed = x + step_size * score_fn(x) + math.sqrt(2 * step_size) * noise
    
    # Compute log-acceptance ratio
    # For Langevin, the proposal is symmetric up to discretization
    score_current = score_fn(x)
    score_proposed = score_fn(x_proposed)
    
    # Log-density change (requires knowing log p, not just score)
    # In practice: log_ratio = log p(x_proposed) - log p(x)
    #              + proposal correction terms
    
    # Accept or reject
    if torch.rand(1) < min(1, torch.exp(log_ratio)):
        return x_proposed, True   # Accept
    else:
        return x, False           # Reject

4.2 SGLD: Stochastic Gradient Langevin Dynamics

For large datasets, SGLD uses mini-batch gradients:

x_{k + 1} = x_{k} + η_{k} \frac{N}{| B |} \sum_{i \in B} \nabla_{x} \log p (x_{k} | y_{i}) + \sqrt{2 η_{k}} z_{k}

where $| B | ≪ N$ is the mini-batch size and $η_{k} \to 0$ with $\sum η_{k} = \infty$ , $\sum η_{k}^{2} < \infty$ .

Key properties:

Scalable to massive datasets
No accept-reject step (unadjusted)
Decreasing step size ensures convergence

4.3 Underdamped Langevin Dynamics

Reintroducing momentum (kinetic Langevin) for faster mixing:

\begin{aligned} d x_{t} & = v_{t} d t \\ d v_{t} & = - γ v_{t} d t + \nabla_{x} \log p (x_{t}) d t + \sqrt{2 γ} d W_{t} \end{aligned}

Advantages over overdamped:

Faster convergence (momentum reduces random-walk behavior)
Better exploration of multi-modal distributions
Used in advanced MCMC samplers (Hamiltonian Monte Carlo is a related approach)

def underdamped_langevin_step(x, v, score_fn, gamma, step_size):
    """One step of underdamped (kinetic) Langevin dynamics."""
    noise = torch.randn_like(v)
    v_new = v + step_size * score_fn(x) - gamma * step_size * v \
            + math.sqrt(2 * gamma * step_size) * noise
    x_new = x + step_size * v_new
    return x_new, v_new

5. Connection to Key Concepts

5.1 Langevin Dynamics → [[Score Function]]

Langevin dynamics is the primary consumer of the [[Score Function]] in generative modeling:

\underset{Langevin dynamics requires}{\underset{⏟}{x_{k + 1} = x_{k} + η \nabla_{x} \log p (x_{k}) + \sqrt{2 η} z_{k}}} ⟵ \underset{Score function provides}{\underset{⏟}{s_{θ} (x) \approx \nabla_{x} \log p (x)}}

Without the [[Score Function]], Langevin dynamics cannot sample. Without Langevin dynamics, the learned score has no sampling mechanism. This mutual dependency makes them the core pair of score-based generation.

5.2 Langevin Dynamics → [[Diffusion Model]]

In diffusion models, Langevin dynamics appears as:

Corrector step: Refines samples after predictor ODE/[[Stochastic Differential Equation (SDE)|SDE]] steps
Ancestral sampling connection: DDPM reverse process can be viewed as Langevin dynamics with a learned score
Quality boost: Even 1-2 corrector Langevin steps significantly improve FID

5.3 Langevin Dynamics → [[Stochastic Differential Equation (SDE)]]

The overdamped Langevin equation is an Itô SDE:

d x_{t} = \underset{drift b (x, t)}{\underset{⏟}{\nabla_{x} \log p (x_{t})}} d t + \underset{diffusion σ (x, t)}{\underset{⏟}{\sqrt{2}}} d W_{t}

This connects to the general [[Stochastic Differential Equation (SDE)|SDE]] framework used in score-based generative models (Song et al., 2021), where different choices of $b$ and $σ$ define different diffusion processes.

5.4 Langevin Dynamics → [[Fokker-Planck Equation]]

The [[Fokker-Planck Equation]] provides the density-level description of Langevin dynamics:

\partial_{t} ρ_{t} = - \nabla \cdot (ρ_{t} \nabla \log p) + Δ ρ_{t}

This PDE describes how the ensemble distribution evolves — and proves that $p (x)$ is the stationary solution.

5.5 Langevin Dynamics → [[Wiener Process|Wiener Process]]

The noise term $d W_{t}$ in Langevin dynamics is a [[Wiener Process|Wiener Process]] — the continuous-time limit of the discrete Gaussian noise $z_{k} \sim N (0, I)$ . Without the [[Wiener Process|Wiener Process]], Langevin dynamics would collapse to deterministic gradient ascent and fail to explore the distribution.

6. Practical Implementation

6.1 Complete Langevin Sampler

class LangevinSampler:
    """Complete Langevin dynamics sampler with diagnostics."""
    
    def __init__(self, score_fn, step_size=1e-3, steps=100,
                 noise_annealing=True, clipping=1e3):
        self.score_fn = score_fn
        self.step_size = step_size
        self.steps = steps
        self.noise_annealing = noise_annealing
        self.clipping = clipping
    
    def sample(self, x_init, return_trajectory=False):
        x = x_init.clone()
        trajectory = [x.clone()] if return_trajectory else None
        
        for k in range(self.steps):
            score = self.score_fn(x)
            
            # Gradient clipping for stability
            score = torch.clamp(score, -self.clipping, self.clipping)
            
            # Optional: anneal noise
            if self.noise_annealing:
                eta = self.step_size * (1 - k / self.steps)
            else:
                eta = self.step_size
            
            noise = torch.randn_like(x)
            x = x + eta * score + math.sqrt(2 * eta) * noise
            
            if return_trajectory:
                trajectory.append(x.clone())
        
        return (x, trajectory) if return_trajectory else x
    
    def sample_multiple(self, n_samples, x_shape, device='cpu'):
        """Generate multiple independent samples."""
        x = torch.randn(n_samples, *x_shape, device=device)
        return self.sample(x)

6.2 Step Size Tuning

Symptom	Likely Cause	Fix
Diverging samples	Step size too large	Reduce $η$ , add gradient clipping
No mixing (stuck)	Step size too small	Increase $η$ , add momentum
Mode collapse	Insufficient noise	Use annealing schedule
High autocorrelation	Underdamped needed	Add momentum (kinetic Langevin)
Numerical instability	Poor score estimate	Gradient clipping, check score model

6.3 Computational Complexity

For $d$ -dimensional data and $K$ steps:

Component	Cost per step	Total cost
Score evaluation	$O (d \cdot params)$	$K \times$ score cost
Noise generation	$O (d)$	$O (K d)$
State update	$O (d)$	$O (K d)$
Total	—	$O (K \cdot d \cdot params)$

The dominant cost is score model evaluation — in diffusion models, the [[U-Net]]/[[DiT]] forward pass for each Langevin corrector step.

7. Theoretical Properties

7.1 Reversibility and Detailed Balance

The overdamped Langevin [[Stochastic Differential Equation (SDE)|SDE]] is reversible with respect to $p (x)$ . This means the process satisfies detailed balance:

p (x) T (x \to y) = p (y) T (y \to x)

where $T$ is the transition kernel. Reversibility ensures that $p$ is indeed the invariant measure.

7.2 Ergodicity

Under mild conditions (positive density, smooth score, proper tails), Langevin dynamics is ergodic:

lim_{T \to \infty} \frac{1}{T} \int_{0}^{T} f (x_{t}) d t = E_{x \sim p} [f (x)] a.s.

This guarantees that time averages converge to ensemble averages — a crucial property for MCMC applications.

7.3 Mixing Time

The mixing time (time to reach $ϵ$ -close in total variation) for a log-concave target:

τ_{mix} (ϵ) = O (\frac{1}{μ} \log (\frac{d}{ϵ}))

where $μ$ is the strong convexity constant. Non-log-concave targets can have exponentially worse mixing.

8. Comparison with Other Sampling Methods

Method	Gradient	Stochastic	Acceptance	Scaling	Best For
Langevin (ULA)	Score only	Yes	No	$O (K d^{2})$	Continuous, differentiable
MALA	Score + log-p	Yes	Yes	$O (K d^{2})$	Exact sampling, high-dim
HMC	Score only	Yes (implicit)	Yes	$O (K d^{2})$	Multi-modal, correlated
Gibbs	None	Conditional	Yes	$O (K d)$	Factorized conditionals
RW Metropolis	None	Yes	Yes	$O (K d^{2})$	Low-dim, non-diff.
Rejection	None	Yes	Yes	$O (\exp (d))$	Low-dim only

Langevin advantage: Only needs $\nabla_{x} \log p (x)$ — no normalization constant, no accept-reject needed (ULA). This makes it uniquely suited for score-based deep generative models.

9. Core Formula Cards

#	Formula	Meaning
1	$d x_{t} = \nabla_{x} \log p (x_{t}) d t + \sqrt{2} d W_{t}$	Overdamped Langevin [[Stochastic Differential Equation (SDE)\|SDE]] (continuous)
2	$x_{k + 1} = x_{k} + η \nabla_{x} \log p (x_{k}) + \sqrt{2 η} z_{k}$	Euler-Maruyama discretization (ULA)
3	$\partial_{t} ρ_{t} = - \nabla \cdot (ρ_{t} \nabla \log p) + Δ ρ_{t}$	[[Fokker-Planck Equation]] for Langevin
4	$W_{2} (ρ_{t}, p) \leq W_{2} (ρ_{0}, p) e^{- μ t} + O (\sqrt{η d})$	Convergence rate (log-concave)
5	$⟨ ξ (t) ξ (t^{'}) ⟩ = δ (t - t^{'})$	White noise correlation ([[Wiener Process
6	$- \nabla U (x) \equiv \nabla_{x} \log p (x)$	Physical potential ↔ probability connection

10. Summary

Langevin dynamics bridges statistical physics and deep generative modeling through a simple yet profound connection: the physical force $- \nabla U (x)$ becomes the [[Score Function]] $\nabla_{x} \log p (x)$ , and thermal noise becomes the exploration mechanism that ensures proper sampling.

Its three key roles in modern ML:

Role	Context	Significance
Standalone sampler	Score-based models (NCSN)	Generates samples from learned score without normalization
Corrector	Predictor-corrector diffusion	Refines samples, improves quality with 1-2 steps
Theoretical bridge	[[Stochastic Differential Equation (SDE)\|SDE]] ↔ Density evolution	Links particle trajectories ([[Stochastic Differential Equation (SDE)\|SDE]]) to distribution evolution (Fokker-Planck)

The equation itself is deceptively simple — $d x = \nabla \log p d t + \sqrt{2} d W$ — yet it unifies MCMC sampling, score-based generation, and nonequilibrium statistical mechanics under one framework.

[[Score Function]]
[[Diffusion Model]]
[[Stochastic Differential Equation (SDE)]]
[[Fokker-Planck Equation]]
[[Wiener Process|Wiener Process]]
[[Probability Flow ODE]]
[[DDIM]]
[[DPM-Solver]]
[[Markov Process]]
[[Martingale]]
[[Metropolis-Hastings]]
[[Hamiltonian Monte Carlo]]

ChungMG

Mathematics & Machine Learning

Langevin Dynamics

1. Core Concept

1.1 Physical Origin: Brownian Motion with Drift

1.2 Overdamped Limit

1.3 From Physics to Sampling

2. Mathematical Foundation

2.1 Stationary Distribution

2.2 Discrete-Time Approximation (Euler-Maruyama)

2.3 Discretization Error

2.4 Convergence Rate

3. Langevin Dynamics for Generative Modeling

3.1 Score-Based Sampling

3.2 Annealed Langevin Dynamics

3.3 Correctors in Predictor-Corrector Framework

3.4 Comparison: Langevin vs. ODE vs. [[Stochastic Differential Equation (SDE)|SDE]] Sampling

4. Algorithmic Variants

4.1 MALA: Metropolis-Adjusted Langevin Algorithm

4.2 SGLD: Stochastic Gradient Langevin Dynamics

4.3 Underdamped Langevin Dynamics

5. Connection to Key Concepts

5.1 Langevin Dynamics → [[Score Function]]

5.2 Langevin Dynamics → [[Diffusion Model]]

5.3 Langevin Dynamics → [[Stochastic Differential Equation (SDE)]]

5.4 Langevin Dynamics → [[Fokker-Planck Equation]]

5.5 Langevin Dynamics → [[Wiener Process|Wiener Process]]

6. Practical Implementation

6.1 Complete Langevin Sampler

6.2 Step Size Tuning

6.3 Computational Complexity

7. Theoretical Properties

7.1 Reversibility and Detailed Balance

7.2 Ergodicity

7.3 Mixing Time

8. Comparison with Other Sampling Methods

9. Core Formula Cards

10. Summary

Langevin Dynamics

1. Core Concept

1.1 Physical Origin: Brownian Motion with Drift

1.2 Overdamped Limit

1.3 From Physics to Sampling

2. Mathematical Foundation

2.1 Stationary Distribution

2.2 Discrete-Time Approximation (Euler-Maruyama)

2.3 Discretization Error

2.4 Convergence Rate

3. Langevin Dynamics for Generative Modeling

3.1 Score-Based Sampling

3.2 Annealed Langevin Dynamics

3.3 Correctors in Predictor-Corrector Framework

3.4 Comparison: Langevin vs. ODE vs. [[Stochastic Differential Equation (SDE)|SDE]] Sampling

4. Algorithmic Variants

4.1 MALA: Metropolis-Adjusted Langevin Algorithm

4.2 SGLD: Stochastic Gradient Langevin Dynamics

4.3 Underdamped Langevin Dynamics

5. Connection to Key Concepts

5.1 Langevin Dynamics → [[Score Function]]

5.2 Langevin Dynamics → [[Diffusion Model]]

5.3 Langevin Dynamics → [[Stochastic Differential Equation (SDE)]]

5.4 Langevin Dynamics → [[Fokker-Planck Equation]]

5.5 Langevin Dynamics → [[Wiener Process|Wiener Process]]

6. Practical Implementation

6.1 Complete Langevin Sampler

6.2 Step Size Tuning

6.3 Computational Complexity

7. Theoretical Properties

7.1 Reversibility and Detailed Balance

7.2 Ergodicity

7.3 Mixing Time

8. Comparison with Other Sampling Methods

9. Core Formula Cards

10. Summary

Related Concepts